feat(benchmark): gateway benchmark harness (footprint, scaling, config, heap) by bburda · Pull Request #64 · selfpatch/selfpatch_demos

bburda · 2026-06-18T14:23:29Z

What this gives

A benchmark for the gateway's runtime cost that points at what to optimize AND tracks whether a change improved or regressed it. A Python orchestrator drives docker compose and samples the gateway process via /proc (USS/PSS, CPU-cores), with repeats and confidence intervals - not single readings.

python -m benchmark.benchmark footprint --duration 300 --repeats 5
python -m benchmark.benchmark scaling --entities 10,30,60,100,150,250
python -m benchmark.benchmark sweep --entities 50
python -m benchmark.benchmark load --entities 30
python -m benchmark.benchmark fault --faults 1,2,4,8,16
export ROS2_MEDKIT_REF=<sha>            # pin the gateway commit to benchmark
python -m benchmark.benchmark compare --run latest --baseline benchmark/baseline/<host>.json

Each lane writes a table, a chart, and a JSON summary with a verdict line. benchmark/README.md documents the method and metrics.

Lanes

footprint / scaling / sweep - steady-state memory and CPU on the real Nav2 demo and a synthetic graph; scaling fits the growth curve with a CI.
heap / memcheck - heaptrack heap growth + call-sites; valgrind definitely-lost. scripts/heap_on_nav2.sh runs a long heaptrack on the real Nav2 stack (debug-symbol gateway) - the tracked heap plateaus, so the gateway does not leak on Nav2.
load - footprint, CPU, thread breakdown and request p50/p95 latency under M concurrent HTTP clients (holds a real SSE stream; composes onto footprint via --load).
fault - snapshot-capture impact as a burst (peak memory/CPU, capture duration, recovery) vs fault count, a fresh container per N.

Regression tracking ("did we improve?")

Every run records the gateway SHA (ROS2_MEDKIT_REF, pinnable through Compose), demo image digest, host CPU/RAM/allocator, and a high-load flag.
compare diffs a run against a committed baseline/<host>.json, refuses cross-machine or high-load runs, and exits non-zero on regression (USS +10%, CPU +15%, scaling exponent CI crossing 1.0). update-baseline re-pins after a confirmed improvement.
A workflow_dispatch + weekly CI job (self-hosted runner) benchmarks a pinned medkit ref and fails on regression.

Built to not overclaim

Steady-state is enforced (a still-rising run is flagged not-steady and excluded; the report shows the steady/total count).
Scaling verdict is CI-gated (sub/super-linear only when the CI clears 1, else INDETERMINATE; degenerate small-n fits forced to INDETERMINATE; per refresh rate, not pooled).
The leak slope CI is autocorrelation-corrected; a positive /proc slope without heaptrack call-sites is inconclusive, not a leak; the heap lane discloses that /proc USS under heaptrack is inflated.
The fault lane uses a fresh container per N (clean baseline); the load lane reports median latency across repeats.

What it found (one host, illustrative)

Footprint on real Nav2: gateway USS ~95-100 MiB, ~0.2-0.3 CPU-cores.
Scaling: USS ~ entities^0.46, CI [0.26, 0.65] - sub-linear confirmed.
Config: discovery refresh interval is the main CPU lever (200 ms ~4x the 1000 ms default).
Load: ~50 threads (~39 = executor + httplib pool), CPU 18x idle->heavy, p95 2.3 ms.
Fault: snapshot-capture peak grows monotonically with N (~0.5 -> ~5.8 MiB at N=16); recovers only for N<=2.
Heap: no leak on Nav2 (tracked heap plateaus over a 25-min heaptrack run).

Notes

Synthetic lanes run the gateway and graph (or fault_manager) in one container (the Docker bridge does not forward DDS multicast) and build a debug-symbol gateway image for heap/leak work. Runs on plain Docker (probing via docker exec also covers docker-out-of-docker). Unit tests: 158. The CI job needs a fixed self-hosted runner so the host-keyed baseline stays valid.

Related Issue

n/a

Checklist

Tested locally
README updated (benchmark/README.md)

Copilot

Pull request overview

Adds a new benchmark/ Python-based harness to measure the ROS2 Medkit gateway’s runtime cost (memory footprint, scaling behavior, config sweep impact, and heap/leak signals) by orchestrating Docker Compose runs and sampling /proc metrics, with accompanying report/chart generation and unit tests for the pure parsing/aggregation logic.

Changes:

Introduces a benchmark CLI (python -m benchmark.benchmark) with lanes: footprint, scaling, sweep, heap, memcheck, attribute, and report aggregation.
Adds a synthetic ROS 2 graph generator (rclpy) plus Docker Compose + Dockerfile tooling to run gateway + graph in a single container.
Adds a substantial pure-Python library layer for sampling/parsing/metrics/reporting, covered by unit tests and fixtures.

Reviewed changes

Copilot reviewed 48 out of 52 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
benchmark/benchmark.py	Main CLI orchestrator for all benchmark lanes; run directory management, aggregation, and reporting.
benchmark/turtlebot3.py	Demo wiring/config for the turtlebot3 integration benchmark target.
benchmark/README.md	Usage docs, prerequisites, lane descriptions, and quickstart commands.
benchmark/requirements.txt	Python dependencies for report/chart generation and tests.
benchmark/.dockerignore	Excludes results and caches from Docker build context.
benchmark/configs/overrides.yaml	Config override sets used for the sweep lane.
benchmark/lib/config_sweep.py	Pure helpers for merging and applying param overrides at the gateway namespace root.
benchmark/lib/docker_helpers.py	Docker/Compose wrappers for starting services, exec’ing commands, and reading `/proc` files.
benchmark/lib/gateway_client.py	JSON parsing helper for collection endpoints (items-count).
benchmark/lib/leak_parse.py	Pure parsers for heaptrack and valgrind memcheck summaries.
benchmark/lib/metrics.py	Pure numeric/stat helpers (median/IQR/linfit/slope CI/log-log exponent/steady window).
benchmark/lib/report.py	Repeat aggregation + lane verdict logic and markdown/chart renderers.
benchmark/lib/runner.py	Shared “cell runner” logic: start container, warmup, sample window, summarize.
benchmark/lib/runmeta.py	Captures run metadata (host, kernel, allocator, image digest, etc.).
benchmark/lib/sampler.py	`/proc` sampling and parsing (`smaps_rollup`, `status`, `stat`) + CSV writing.
benchmark/lib/warmup.py	Warmup predicates (entity-count stability + USS derivative threshold).
benchmark/lib/init.py	Package marker for `benchmark.lib`.
benchmark/scaler/spawn_nodes.py	Synthetic graph planning (node/topic/service/param specs).
benchmark/scaler/synthetic_graph.py	rclpy-based synthetic graph host that publishes and exposes services.
benchmark/scaler/init.py	Package marker for `benchmark.scaler`.
benchmark/profiles/synthetic.compose.yml	Docker Compose profile to run gateway + synthetic graph in one container.
benchmark/profiles/Dockerfile.benchmark	Benchmark image build (ROS Jazzy, tools, clone/build ros2_medkit).
benchmark/profiles/run_gateway_and_graph.sh	Container entrypoint to start synthetic graph and gateway (optionally under heaptrack/valgrind).
benchmark/profiles/fastdds.supp	Valgrind suppressions for FastDDS-related shutdown noise.
benchmark/tests/test_cli_wiring.py	CLI help/subcommand presence test.
benchmark/tests/test_config_sweep.py	Unit tests for deep-merge and override application behavior.
benchmark/tests/test_docker_helpers.py	Unit tests for PID parsing error cases.
benchmark/tests/test_gateway_client.py	Unit tests for items-count JSON parsing.
benchmark/tests/test_heap_report.py	Unit tests for heap report markdown rendering.
benchmark/tests/test_leak_parse.py	Unit tests for heaptrack/memcheck summary parsing.
benchmark/tests/test_memcheck_report.py	Unit tests for memcheck report markdown rendering.
benchmark/tests/test_metrics.py	Unit tests for numeric/stat helpers.
benchmark/tests/test_overrides_load.py	Unit tests for loading override sets YAML.
benchmark/tests/test_report_aggregate.py	Unit tests for repeat aggregation + verdict helpers.
benchmark/tests/test_report_render.py	Unit tests for footprint markdown rendering formatting/contents.
benchmark/tests/test_runner_summary.py	Unit tests for window summarization output keys and sanity.
benchmark/tests/test_runmeta.py	Unit tests for required run metadata fields.
benchmark/tests/test_sampler_loop.py	Unit tests for sampling loop helpers and CPU-cores derivation.
benchmark/tests/test_sampler_parse.py	Unit tests for `/proc` parsing routines with fixtures.
benchmark/tests/test_scaler_plan.py	Unit tests for synthetic graph planning (counts, uniqueness, cardinality).
benchmark/tests/test_scaling_rows.py	Unit tests for scaling row derivation (USS per entity).
benchmark/tests/test_validation.py	Unit tests for synthetic-vs-demo validation messaging.
benchmark/tests/test_warmup.py	Unit tests for warmup predicate helpers.
benchmark/tests/fixtures/smaps_rollup.txt	Fixture for smaps_rollup parsing tests.
benchmark/tests/fixtures/stat.txt	Fixture for `/proc/<pid>/stat` parsing tests.
benchmark/tests/fixtures/status.txt	Fixture for `/proc/<pid>/status` parsing tests.
benchmark/tests/fixtures/memcheck.txt	Fixture for valgrind memcheck parsing tests.
benchmark/tests/fixtures/heaptrack_print.txt	Fixture for heaptrack_print parsing tests.
benchmark/tests/fixtures/medkit_params.yaml	Fixture for params override application tests.
benchmark/tests/init.py	Package marker for `benchmark.tests`.
benchmark/init.py	Package marker for `benchmark`.
.gitignore	Ignores benchmark results output, baked params, and benchmark pyc/caches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 66 out of 71 changed files in this pull request and generated 6 comments.

/proc USS/PSS/CPU sampling, the Student-t + AR(1) statistics engine, run metadata, config-override loading and leak/memcheck log parsing, with unit tests.

…compare Fresh-container cell runner with the enforced warm-up gate, median/IQR aggregation and CI-gated report rendering, the transient burst sampler, and the baseline-diff engine.

…r profiles Synthetic ROS 2 graph + HTTP load generator, the fault_manager injector, and the single-container gateway/graph/fault images and entrypoints.

The orchestrator CLI wiring every lane plus all/report, and the harness README documenting the method, metrics and lanes.

Rebuilds the gateway with debug symbols and runs it under heaptrack attached to the real Nav2 graph; the tracked heap plateaus, so the gateway does not leak on Nav2.

Pin the gateway commit via ROS2_MEDKIT_REF through Compose, capture the SHA in the demo image, seed a host-keyed baseline, and add a dispatch+weekly CI job that compares a run against it and fails on regression.

Add a churn lane that gates gateway memory growth under ROS graph churn (static vs churning-graph USS slope, PASS/FAIL, exit 1 on leak), plus a synthetic-graph churn mode (BENCH_CHURN_SEC / BENCH_CHURN_COUNT). Honesty and robustness fixes so lanes report real data instead of silent zeros: - memcheck: run valgrind directly on gateway_node (not the ros2 launcher), capture stderr, gate on readiness, poll for the LEAK SUMMARY; fix the malformed fastdds.supp that made valgrind abort at startup - heap: bash pipefail and fail loud when heaptrack produces no summary - heap_on_nav2.sh: two-phase (clean USS without heaptrack as the leak verdict, short heaptrack pass for call-site attribution) with an OLS slope CI - scaling regression gate: baseline-relative CI comparison instead of an absolute ci_lo>1 threshold; absolute floor only when baseline is absent - compare: gate high host load across all lanes, not just the first - docker_helpers: per-call subprocess timeouts and curl --max-time / --connect-timeout; accept any positive CLK_TCK; merge_stderr option - sampler: tolerate transient /proc read errors, stop after the process is gone - burst: take clk_tck as a parameter; require USS to leave the band before declaring recovery - load_gen: include timed-out requests in tail latency, report error_rate, stop on SIGTERM - fault lane: mark failed cells and exclude them from the table, chart and optimization signals instead of rendering fabricated zeros - runner: warm the gateway under load for the load lane; treat an empty sample window as a failed cell - report: leak verdict wording (leaked-at-exit, not "heap grew") - cmd_report / _latest_run_dir: clear error on missing or empty results dir - cmd_load: per-level thread census - turtlebot3: override_root typed as list[str] Tooling: - --run-dir to write several lanes into one shared run dir for a single compare - CI runs the harness unit tests on a GitHub-hosted runner - portable test working directory; docs and unit tests for all of the above

mfaferek93

LGTM!

Copilot AI review requested due to automatic review settings June 18, 2026 14:23

Copilot started reviewing on behalf of bburda June 18, 2026 14:24 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

Comment thread benchmark/tests/test_cli_wiring.py Outdated

Comment thread benchmark/lib/docker_helpers.py

Comment thread benchmark/benchmark.py

Comment thread benchmark/lib/report.py Outdated

Comment thread benchmark/benchmark.py

Comment thread benchmark/turtlebot3.py Outdated

bburda force-pushed the feat/benchmark-harness branch 2 times, most recently from 6242b91 to b9997a7 Compare June 18, 2026 14:53

bburda marked this pull request as draft June 18, 2026 15:00

bburda force-pushed the feat/benchmark-harness branch 6 times, most recently from cea132c to 4d43cc1 Compare June 19, 2026 17:14

bburda self-assigned this Jun 19, 2026

bburda requested review from Copilot and mfaferek93 June 19, 2026 17:27

Copilot started reviewing on behalf of bburda June 19, 2026 17:27 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread benchmark/tests/test_cli_wiring.py

Comment thread benchmark/benchmark.py

Comment thread benchmark/lib/fault_injector.py

Comment thread benchmark/lib/fault_injector.py

Comment thread benchmark/turtlebot3.py Outdated

Comment thread .github/workflows/benchmark.yml

bburda added 6 commits June 19, 2026 19:33

feat(benchmark): process sampling + statistics core

5a0e1b4

/proc USS/PSS/CPU sampling, the Student-t + AR(1) statistics engine, run metadata, config-override loading and leak/memcheck log parsing, with unit tests.

feat(benchmark): steady-state runner, aggregation/reporting, burst + …

724a3a6

…compare Fresh-container cell runner with the enforced warm-up gate, median/IQR aggregation and CI-gated report rendering, the transient burst sampler, and the baseline-diff engine.

feat(benchmark): synthetic graph generators, fault injector, containe…

7966f38

…r profiles Synthetic ROS 2 graph + HTTP load generator, the fault_manager injector, and the single-container gateway/graph/fault images and entrypoints.

feat(benchmark): CLI with footprint/scaling/sweep/heap/load/fault lanes

8fc5a03

The orchestrator CLI wiring every lane plus all/report, and the harness README documenting the method, metrics and lanes.

feat(benchmark): heap-on-Nav2 tooling

ab3e490

Rebuilds the gateway with debug symbols and runs it under heaptrack attached to the real Nav2 graph; the tracked heap plateaus, so the gateway does not leak on Nav2.

feat(benchmark): regression tracking - pin gateway SHA, baseline, CI

e8075e4

Pin the gateway commit via ROS2_MEDKIT_REF through Compose, capture the SHA in the demo image, seed a host-keyed baseline, and add a dispatch+weekly CI job that compares a run against it and fails on regression.

bburda force-pushed the feat/benchmark-harness branch from 4d43cc1 to e8075e4 Compare June 19, 2026 17:34

mfaferek93 reviewed Jun 20, 2026

View reviewed changes

bburda marked this pull request as ready for review June 20, 2026 08:05

mfaferek93 approved these changes Jun 21, 2026

View reviewed changes

bburda merged commit 471ca2c into main Jun 21, 2026
5 checks passed

bburda deleted the feat/benchmark-harness branch June 21, 2026 09:21

Conversation

bburda commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this gives

Lanes

Regression tracking ("did we improve?")

Built to not overclaim

What it found (one host, illustrative)

Notes

Related Issue

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfaferek93 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bburda commented Jun 18, 2026 •

edited

Loading